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Synonymous codons, i.e., DNA nucleotide triplets coding for the same amino acid, are used 
differently across the variety of living organisms. The biological meaning of this phenomenon, 
known as codon usage bias, is still controversial. In order to shed light on this point, we propose a 
new codon bias index, CompAI, that is based on the competition between cognate and near-cognate 
tRNAs during translation, without being tuned to the usage bias of highly expressed genes. We 
perform a genome-wide evaluation of codon bias for E.coli , comparing CompAI with other widely 
used indices: tAI, CAI, and Nc. We show that CompAI and tAI capture similar information 
by being positively correlated with gene conservation, measured by ERI, and essentiality, whereas, 

CAI and Nc appear to be less sensitive to evolutionary-functional parameters. Notably, the rate 
of variation of tAI and CompAI with ERI allows to obtain sets of genes that consistently belong 
to specific clusters of orthologous genes (COGs). We also investigate the correlation of codon bias 
at the genomic level with the network features of protein-protein interactions in E.coli. We find 
that the most densely connected communities of the network share a similar level of codon bias (as 
measured by CompAI and tAI). Conversely, a small difference in codon bias between two genes is, 
statistically, a prerequisite for the corresponding proteins to interact. Importantly, among all codon 
bias indices, CompAI turns out to have the most coherent distribution over the communities of the 
interactome, pointing to the significance of competition among cognate and near-cognate tRNAs for 
explaining codon usage adaptation. 


INTRODUCTION 

The genetic information carried by the mRNA and then translated into proteins is encoded into nucleotide triplets 
called codons. Four alternate nucleotidic bases (A,U,C,G) compose the mRNA, so that there are 4 3 = 64 possible 
codons that have to code for only 20 naturally occurring amino acids. The genetic code is therefore redundant: while a 
few amino acids correspond to a single codon, most amino acids can be encoded by different codons. Different codons 
coding for the same amino acid are known as synonymous codons, and in a wide variety of organisms synonymous 
codons are used with different frequencies—a phenomenon known as codon bias. With the advent of whole-genome 
sequencing of numerous species, genome-wide patterns of codon bias are emerging in the different organisms. Various 
factors such as expression level, GC content, recombination rates, RNA stability, codon position, gene length, environ¬ 
mental stress and population size, can influence codon usage bias within and among species [l[. While the biological 
meaning and origin of codon bias is not yet fully understood, there is a large consensus that the degeneration of 
the genetic code might provide an additional degree of freedom to modulate accuracy and efficiency of translation 
f2j. Indeed, population genetic studies Q have shown that synonymous sites are under weak selection, and that 
codon bias is maintained by a balance between mutation-selection (random variability in genetic sequences followed 
by fixation of the optimal codons) and genetic drift (allowing for the occurrence of non-optimal codons). In fact, 
highly expressed genes feature an extreme bias by using a small subset of codons, optimized by translational selection 
[j-[g|. On the other hand, the persistence of non-optimal codons in less-expressed sequences causes long breaks during 
protein synthesis; this could be the result of genetic drift and have a key role in the protein folding process [7, : 8|. 
In addition, codon usage appears to be structured along the genome, with neighboring genes having similar codon 
compositions [§], and codon bias seems positively correlated to gene length (as a result of selection for accuracy in 
the costly production of long proteins) [Tq] . In the last years there has been a wide effort in developing effective ways 
to measure codon bias [llj . The most widely used indices include the Codon Adaptation Index ( CAI) [12} , the tRNA 
Adaptation Index (tAI) and the Effective Number of Codons (Nc) [HI, each of them having specific advantages 
and drawbacks. For instance, CAI and tAI correlate well with gene expression levels, however such correlation is a 
natural consequence of their definition: they are tuned on a reference set of highly expressed genes. Nc is instead 
basically a measure of the entropy of the codon usage distribution, and thus shows a lower correlation with expression 
levels. 
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In this work we propose a novel codon bias index named Competition Adaptation Index ( CompAI ), which does not 
rely on information about gene expression levels, but instead has a self-consistent biological meaning—based on tRNA 
availability and competition between cognate and near-cognate tRNAs. In other words, CompAI is a parameter-free 
index that does not require a set of reference genes for its calibration, a fact that constitutes its main advantage with 
respect to CAI and tAI. Moreover, CompAI is designed to extract genetic signals that could be directly correlated to 
experimental measures for translation speeds, an emerging and challenging issue still to be explored. In order to show 
the advantage of the novel codon bias index, we perform a genome-wide comparison of CAI , tAI , Nc and CompAI 
for Escherichia Coli (E.coli). Our analysis reveals that the information on gene conservation across species and gene 
essentiality is better captured by codon bias metrics that build on tRNA availability (tAI and CompAI). We also 
study codon bias in relation to the connectivity patterns of the protein-protein interaction network (PIN) |15[ of E.coli. 
We thus show that translational selection systematically favors proteins with the highest number of interactions and 
belonging to the most densely connected community of the network, at least when the bias is measured by CompAI 
and, to a smaller extent, by tAI. Additionally, we address the issue of how much a similarity in the codon usage bias 
of a set of genes is reflected on the interactions among the corresponding proteins. A principal component analysis 
for the variability of codon bias indices indeed reveals that closeness of a set of genes in the space of the two principal 
components likely results in the corresponding proteins to interact- in comparison with an appropriate null model. 

Overall, our study reveals that CompAI captures more information than the other indices about the connection 
between codon bias and the topology of the interactome. Besides, we recall that CompAI does not require calibration 
on gene expression levels and has a consistent biological meaning based on the competition between cognate and 
near-cognate tRNAs. These observations stress the potential of the new index to both measure and explain codon 
usage bias. 


MATERIALS AND METHODS 
Sequences 

In this work we investigate the genome of E.coli K-12 substr. MG1655, whose 4005 coding mRNA sequences have 
been collected from GenBank |l6(. The gene copy numbers coding for each tRNA (tGCN) were derived from the 

Genomic tRNA database [13- 


Conservation and Essentiality of E.coli genes 

In order to have an index for gene “conservation”, we use the normalized Evolutionary Retention Index (ERI) [Il| : 
for each gene in E.coli’s genome, its ERI measures how much that gene is shared among other 32 bacterial species 
(having at least an ortholog of the given gene). A low ERI value thus denotes that a gene is specific to E.coli, whereas, 
high ERI is characteristic of highly shared (and therefore conserved) genes. Concerning gene “essentiality”, we use 
the classification of Gerdes et al. [l8| for the E.coli genome into 606 essential and 2940 non-essential genes, based on 
experimental measures of gene resistance against transposon insertion. 


Codon Bias Indices 

Codon usage bias can be assessed, for each gene in a given genome, by various indices that can be classified into 
broad groups based on: (i) codon frequencies; (ii) reference gen e sets; (iii) deviation from a postulated distribution; 
(iv) information theory; (v) interactions among tRNAs (see [11] for an overview). We focus here on the most widely 
used indices: tAI [r|, that belongs to groups (ii) and (v) by requiring calibration on a set of highly expressed genes; 
CAI El, a group (i) and (ii) index built on local statistics of codon usage and on a reference list of optimally expressed 
genes; Nc [141 ] . a group (i) index based on the number of different codons used in a coding sequence. The novel codon 
bias index we propose in this work, CompAI , is instead based on the competition of cognate and near-cognate tRNAs 
to bind to the A-site on the ribosome during translation, and is thus a group (v) index that does not need tuning on 
a reference set of highly expressed genes. While the formal definition for CompAI and the rationale behind are given 
below, we refer to the Supporting Information for the operative definition of CAI , tAI and Nc. 

Competition Adaptation Index (CompAI). It is generally accepted that translation speed depends on the 
efficiency of the codon/anticodon pairing in the A site of the ribosome [lj|. Hence, for a given codon, the rate of 
amino acid synthesis is essentially influenced by two dominant processes: the number of collisions of the corresponding 
tRNA with the ribosomal A site (which strongly depends on tRNA concentration in the cell) and the specificity of 
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the codon-anticodon pairing. Such a pairing process satisfies the Watson-Crick base-pairing rules (G-C and A-U, and 
vice versa) for the first two bases, whereas, the rule on the third (or wobble) base is more relaxed and non-standard 
pairing is allowed in some cases (wobble complementarity) [20]. Hence, there are cases in which several tRNAs pair 
with the same codon (provided that these are identical in the first two bases) and are called isoacceptor or cognate 
tRNAs. Codon-anticodon interactions are thus characterized by competition between cognate tRNA (with perfect or 
wobble complementarity between mRNA codon and tRNA anticodon), near-cognate tRNA (with a mismatch in only 
one of the first two bases) and non-cognate tRNA (with at least two mismatches). Discrimination between correct 
and wrong tRNA according to base pairing features very high fidelity (error rate / ~ 10 -2 4- 10 -4 ). Rejection of the 
wrong tRNA can occur in two distinct phases initial selection of the ternary complex EF-Tu-GTP-aa-tRNA 

and subsequent proofreading of aa-tRNA after GTP hydrolyzation. The first interaction is fast and does not depend 
on the choice of codons, in order to allow the ribosome to quickly screen for the available tRNAs. The second step 
is instead sensitive to base complementarity, featuring the first selection between cognate and near-cognate tRNAs: 
non-cognate are excluded almost immediately with / ~ 10 _1 , and then a more strict and efficient proofreading takes 
place, excluding near-cognate with / ~ 10 -2 . This means that near-cognate tRNA (unlike non-cognate tRNA) can 
enter into the interaction process between the ternary complex and the site A of the ribosome, and (when not accepted, 
in very few cases) can be rejected at the stage of initial recognition or during proofreading. In any event, this process 
results in a time delay of translation, because near-cognate rejection brings the ribosome back to the initial state of 
waiting for the correct tRNA. 

The rationale behind the definition of CompAI is precisely that of building an index which is based both on tRNA 
availability and on competition between cognate and near-cognate tRNAs that could modulate the speed of translation 
of mRNAs into proteins. Note that, since in vivo experimental determinations of tRNA concentrations are available 
only for few organisms, we will implement CompAI using the number of tRNA gene copies (tGCN) which, at least in 
simple organisms, has a hig h and positive correlation with tRNA abundance [22 h 25I | (a similar approach is adopted 
in the definition of tAI [K|). For each codon i we define its absolute adaptiveness value (Wj) as: 

mi 

EtGCNy 

-—- • (1) 

rrii 171 i 

E tGCNjj + E tGCN”/ 

3 =1 3 =1 

Here mj is the number of isoacceptor tRNA sequences (anticodons) that recognize codon i [i.e., containing either the 
anti-codon i or all its cognates that are read by i) and tGCNy is the gene copy number of the j- th of such tRNAs, 
whereas m” c is the number of tRNA sequences that are near-cognate of i and tGCN” c is the gene copy number of 
the j-th of such tRNAs (see also Fig. [Si] of Supporting Figures). The amount in square brackets represents a penalty 
introduced by the competition with near-cognate tRNAs, assuming unit or zero values in the cases of smaller and 
higher competition, respectively. This term thus assumes the role of selective constraint on the efficiency of the codon- 
anticodon coupling. Importantly, and at odds with tAI , these terms do not result from optimization on expression 
levels, but have a biological justification based on cognate/near-cognate competition. Note that, in the computation 
of Wj for a given codon, we count as isoacceptor tRNAs those with perfect or wobble base pairing that also carry 
the same amino acid of i’s anticodon. Computation of CompAI continues by defining for each codon i its relative 
adaptiveness value Wi = Wj/W maa; , where W ma x is the maximum value between all the Wj of codons. CompAI of 
gene g is finally defined as the harmonic mean of the relative adaptiveness of its codons: 

CompAIg = — (2) 

3 i 

E«E 

i =1 

The choice of the harmonic mean (rather than geometric as for CAI and tAI ) is consistent with the association of 
CompAI with the rate of protein synthesis. Indeed, the translation speed of codon i can be defined as the reciprocal 
of the concentration of the corresponding tRNA isoacceptors [26.]. Therefore, if codon i is read at a speed proportional 
to Wi, then the average translation speed of a gene is given by the harmonic mean of the {tCj} associated to its codons. 
CompAI takes values between 0 and 1, where values close to 0 (1) indicate highest (lowest) competition, and therefore 
a low (high) translation rate. 


( mi 

J2 tGCN, 


Protein-Protein Network Analysis 


In this study we use protein interaction data collected in STRING (Known and Predicted Protein-Protein Interac¬ 
tions) [l5|. In such database, each predicted interaction is assigned with a confidence level or probability w, evaluated 
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by comparing predictions obtained by the different techniques [27i-[2!)t with a set of reference associations, namely the 
functional groups of KEGG (Kyoto Encyclopedia of Genes and Genomes) (3Q|. In this way, interactions with high w 
are likely to be true positives, whereas, a low w likely corresponds to a false positive. Since the percentage of false 
positives can be very high [31|, we select a stringent cut-off 0 = 0.9 that allows a fair balance between coverage and 
interaction reliability (see the probability distribution P(w) in the left panel of Fig. (S2l of Supporting Figures). We 
thus build the protein-protein interaction network (PIN) of E.coli by placing a link between each pair of proteins 
(nodes) i,j provided that Wij > 0. The resulting number of connections or degree for a given protein i is denoted as 

h. 

To detect communities of PIN we resort to Molecular Complex Detection (MCODE) [32[. In a nutshell, MCODE 
iteratively groups together neighboring nodes with similar values of the core-clustering coefficient, which for each 
node is defined as the density of the highest k -core of its immediate neighborhood times k 50). MCODE detects 
the densest regions of the network and assigns to each found community a score that is its internal link density 
times the number of nodes belonging to it [51j . We also characterize each found community c with the mean value 
x c and standard deviation a c of codon bias values within the community, and use them to compute a 2-score as 
2 C = (x c — x n )/\Jcri + <7f (where x n and a n are, respectively, the mean value and standard deviation of codon bias 
values computed on the whole network). In this way, a value of 2 C > 1 (2 C < —1) indicates that community c features 
significantly higher (lower) codon bias than the population mean. 

Finally note that each node of PIN is photogenically classified according to the Clusters of Orthologous Groups [H] 
(COGs) of proteins [34]. COGs are generated by comparing predicted and known protein sequences in all completely 
sequenced genomes to infer sets of orthologs. Each COG consists of a group of proteins found to be orthologous across 
at least three lineages and likely corresponds to an ancient conserved domain [341 ] . 


Principal Component Analysis 


Principal Component Analysis (PCA) [35j is a multivariate statistical method to transform a set of observations 
of possibly correlated variables into a set of linearly uncorrelated variables (called principal components) spanning a 
space of lower dimensionality. The transformation is defined so that the first principal component accounts for the 
largest possible variance of the data, and each succeeding component in turn has the highest variance possible under 
the constraint that it is orthogonal to ( i.e ., uncorrelated with) the preceding components. 

We use this technique on the space of codon bias indices, so that each gene of E.coli is represented as a 4-dimensional 
vector with coordinates ( CompAI , CAI , tAI , Nc). Such coordinates are separately normalized to zero mean and 
unit variance over the whole genome. We then obtain the associated covariance matrix between the four dimensions 
of codon bias and diagonalize it. The eigenvectors of the covariance matrix, ordered according to the magnitude of 
the corresponding eigenvalues, are the principal components of the original data. 


Configuration Model 


In order to assess how significant are the codon usage patterns observed for the PIN, we need to compare the 
E.coli interactome with a suitable null model for it, i.e., an appropriate randomization of the network. Here we follow 
the most common approach in statistical mechanics of networks of using the Configuration Model (CM) j36| . The 
basic idea is to build the null model as an ensemble D of graphs with maximum entropy, except that the ensemble 
average of the node degrees are constrained to the values observed for the real network: (ki)n = hi V*. This leads to a 
probability distribution over Q which is defined via a set of Lagrange multipliers {xi} (one for each node) associated to 
the constraints [37}]. Once all {xi} are found, the CM reduces to having a link between nodes i and j with probability 
pij = , independently on all other links. Then, the null hypothesis is that any given network property x 

varies in the range (x)n A <7n[x], where both average and standard deviation of x over the ensemble can be obtained 
either analytically or numerically (by drawing sample networks from D) ,37). The number of standard deviations by 
which the empirical and expected values of x differ is given by the 2-score 2[y] = (% — (x)n)/ cr n[x] : large positive 
(negative) values of Z[x] indicate that X is substantially larger (smaller) than expected, whereas, small values signal 
no significant deviation from the null model. 
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RESULTS AND DISCUSSION 
Specificity, Essentiality and Codon Bias of E.coli genes 


Correlations between Codon Bias indices. 

As the starting point of our analysis, we first check how the different codon bias indices correlate over E.coli’s 
genome. Fig. [T] shows that, interestingly, CompAI is strongly (and positively) correlated with tAI. whereas it does 
not show any significant correlation with CAI nor with Nc. This result can be easily explained as CompAI and tAI 
elaborate on the same genetic information, that is the abundance of tRNAs, whereas CAI and Nc are based on codon 
usage statistics (see the Supporting Information). 



0,4 0,5 0,6 0,7 0,8 0,9 1 0,1 0,2 0,3 0,4 0,5 30 35 40 45 50 55 60 


CAI tAI Nc 

FIG. 1. Correlation between codon bias indices. Values of Pearson’s correlation coefficients show that CompAI is strongly 
and positively correlated with tAI (c = 0.74), but not with both CAI nor Nc (c ~ 0). 


Codon Bias and ERI. 

We move further and analyze the correlation between the various codon bias indices and the evolutionary retention 
index (ERI) [18!] for E.coli genes (we recall that a gene with a low ERI value is peculiar to E.coli, whereas a gene with 
high ERI is shared among different species). Fig. [2] reports the average values and standard deviations of the codon 
bias indices for every group of genes having the same ERI value. Interestingly, the evolutionary codon adaptation 
measured by CompAI and tAI tends to increase for genes that are less specific to E.coli. Fig. [2] also suggests that it 
is possible to make a threefold separation of genes by looking at the rate of variation of tAI and CompAI with ERI. 
We thus identify group A (ERI < 0.2: 1597 low ERI genes that are specific to E.coli), group B (0.2 < ERI < 0.9: 
1804 intermediate ERI genes) and group C (ERI > 0.9: 231 high-ERI genes that are highly conserved and shared 
among several bacterial species). In each group, the correlation between codon bias and ERI is maximized (see the 
corresponding correlation coefficients in the figure). CAI and Nc are instead less structured with respect to ERI, as 
shown by the very small correlation coefficients (and by the impossibility to identify gene groups). 


Codon Bias and Gene Essentiality. 

We now study the patterns of codon usage bias in essential and non-essential genes, according to the classification 
scheme of Gerdes et al. [l|j (see Materials and Methods). As a preliminary result, Fig. [3] reports the percentage of 
essential genes in each set of genes sharing the same ERI. We see that the three groups A, B, C of genes identified as 
in the previous paragraph feature different percentages of essential genes: approximately, 10% for group A, 15% for 
group B and above 30% for group C. Essentiality and ERI thus seems to capture similar genetic features. Fig. [Ijshows 
instead that CompAI and tAI are more sensitive than CAI and Nc in distinguishing essential from non-essential 
genes. Overall, Figs. [2] and [I] provide a clear indication that codon bias, as measured by t.AI and CompAI , is more 
pronounced for genes that are highly conserved (be., with high ERI) and essential, on the other hand CAI and Nc 
are less sensitive to these quantities. 


COGs. 

We now perform a kind of gene ontology to check how the three gene groups A, B, C are projected over the clusters 
of orthologous genes (COGs) and their functional annotations [3J|. To this end, for each group we evaluate the 
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FIG. 2. Correlation between the various codon bias indices and ERI. Codon bias average values and standard deviation 
(error bars) are determined for each set of E.coli genes having the same ERI value. In each panel, the solid lines are linear 
regression fits, with c denoting the corresponding correlation coefficients. In the left panels, the fits are performed separately 
for the three groups of genes A (ERI < 0.2), B (0.2 < ERI < 0.9) and C (ERI > 0.9). Both CompAI and tAI monotonously 
increase with ERI, whereas CAI and Nc show a low correlation with ERI. 



FIG. 3. Essentiality for E.coli genes. The percentage of essential genes is reported for each set of genes sharing the same 
ERI. Horizontal solid lines represent average values of essentiality percentage for each group A, B, C of genes (defined by a 
maximum correlation between CompAI -ERI and fAJ-ERI). The groups have different incidences of essential genes: 10% for 
group A (ERI < 0.2), 15% for group B (0.2 < ERI < 0.9) and more than 30% for group C (ERI > 0.9). 


Bayesian probability that its genes belong to a given COG: Pr(COG|group) = Pr(group|COG)Pr(COG)/Pr(group), 
where Pr(group) is estimated as the fraction of the genome belonging to the group, Pr(COG) as the fraction of the 
genome belonging to the COG and Pr(group|COG) is the fraction of genes in the COG that belong to a particular 
group. Fig. [5]shows the histogram of Pr(COG|group) over the 17 COGs, for the three groups A, B, C defined above. 
Assuming an arbitrary discriminating threshold of 10%, we observe that each group is prevalently projected over a 
limited set of COGs (reported in the legend box of Fig. [5|. Group A genes (those with low ERI values) mostly insist 
over COGs K and G (transcription, carbohydrate metabolism); group B (genes with intermediate ERI) is enriched in 
COGs G and E (again, carbohydrate metabolism, amino acid metabolism and transport); finally, group C (genes with 
the highest ERI) is dominated by the functional annotations associated with COGs J and L (translation, ribosome 
structure and biogenesis, replication, recombination and repair). Indeed, group C, composed of the highly adapted, 
essential, and conserved genes of E.coli, is the set of genes that code for ribosomal proteins. 
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FIG. 4. Codon bias indices for essential and non-essential genes. Error bars are standard deviations within each group. 
Then mean value of codon bias is systematically higher for essential genes, however, only CompAI and tAI can effectively 
separate essential from non-essential genes. In fact, in the left panels the average codon bias values for essential and non- 
essential genes have a relative variation of about 5%, whereas, in the right panels such values are almost coincident and the 
errors overlap. 



FIG. 5. Histogram of Pr(COG|group) over the COGs for the three gene groups A, B, C. Each group is characterized 
by one or a few predominant COGs, indicated within parenthesis in the legend (assuming a threshold of 0.1 and excluding 
generic COGs R and S, for which function prediction is too general or missing). 

























































Codon Bias and the Connectivity Patterns of E.coli’s Protein Interaction Network 


Communities. 


We now turn our attention to the network of interacting proteins in E.coli. We start by studying codon bias in 
relation with the connectivity patterns of the network. First, note that the degree distribution of proteins is scale-free 
(see the right panel of Fig. IS2I of Supporting Figures), meaning that the network features a large number of poorly 
connected proteins and a relatively small number of highly connected hubs. Fig. [6] notably shows that these hub 
proteins are systematically characterized by higher values of codon bias of the corresponding genes—when this is 
measured by tAI and CompAI. CAI and Nc are instead clearly less sensitive to protein connectivity. 



FIG. 6. Relation between the various codon bias indices of genes and the degree k of the corresponding proteins 

in the PIN of E.coli. Solid lines are linear fits. CompAI and tAI of a gene definitely increase with the connectivity of the 
corresponding protein in the PIN, whereas the other two indices are less sensitive to this parameter. 


We move further and consider codon bias in relation with the community structure of the PIN. We recall that a 
community is a group of proteins that are more densely connected within each other than with the rest of the network. 
Table U shows the features of the communities that are assigned by MCODE a score higher than 10, together with 
their COG composition, average degree and, for ERI and the various codon bias indices, the internal average value x c 
and the Z-scores (comparing the distribution of bias inside the community with that of the whole network). We see 
that such topologically determined communities, ordered by score, are evolutionarily and functionally characterized 
by a dominant COG, shared by the majority of the proteins in the community. This suggests that the identified 
communities can be associated with specific metabolic functions: they correspond to functional modules, essential for 
the life-cycle of the organism. 

Let us focus on the first community, that includes only 60 proteins (4.5% of the whole network) but as much as 
32.6% of the total number of links in the network, and that basically overlaps with the main core of the PIN (be., the 
k -core with the highest possible degree). Notably, proteins belonging to this community have on average a codon bias 
index (as measured by tAI and, even more, by CompAI ) that is significantly higher than the average of the rest of 
the network (the zGscore is bigger than 1). As noticed above, this core is essentially composed of ribosomal proteins, 
that are usually highly expressed, have the highest codon usage bias, and are broadly conserved and essential across 
different taxa 13811. 
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TABLE I. Features of top-scoring communities. Number of nodes (n), community score (n times the internal density), 
predominant COG label and percentage; then, for ERI and the various codon bias indices, mean values x c internal to the 
community and Z scores (between square brackets). Values of Z > 1 are reported in bold. 


ID 

n 

score 

< k > 

COG 

ERI 

CompAI 

CAI 

tAI 

Nc 

1 

60 

54.9 

63.15 

J (90.0%) 

0.91 

[1.66] 

0.13 

[1.40] 

0.75 

[0.05] 

0.38 

[1.35] 

49.16 

[-0.06] 

2 

31 

28.6 

35.03 

N (74.2%) 

0.38 

[0.21] 

0.08 

[-0.35] 

0.75 

[0.07] 

0.32 

[0.14] 

49.88 

[0.12] 

3 

21 

19.1 

25.85 

C (97.6%) 

0.53 

[0.65] 

0.09 

[0.38] 

0.74 

[-0.13] 

0.34 

[0.72] 

50.18 

[0.2] 

4 

15 

13.9 

18.40 

M (66.7%) 

0.82 

[1.31] 

0.09 

[0.07] 

0.75 

[0.15] 

0.31 

[0.07] 

49.32 

[-0.02] 

5 

13 

11.7 

10.77 

P (76.9%) 

0.20 

[-0.29] 

0.08 

[-0.26] 

0.77 

[0.40] 

0.33 

[0.54] 

48.57 

[-0.22] 

6 

12 

11.5 

11.50 

U (48.9%) 

0.20 

[-0.29] 

0.07 

[-0.82] 

0.76 

[0.26] 

0.28 

[-0.63] 

48.92 

[-0.12] 

7 

11 

10.6 

19.82 

P (63.6%) 

0.56 

[0.70] 

0.09 

[0.44] 

0.76 

[0.26] 

0.34 

[0.72] 

48.74 

[-0.14] 

8 

10 

10.0 

11.60 

C (75.0%) 

0.04 

[-0.86] 

0.07 

[-0.66] 

0.76 

[0.28] 

0.29 

[-0.45] 

47.78 

[-0.30] 


Principal Component Analysis. 


Finally we perform PC A over the space of the four codon bias indices ( CompAI , CAI : tAI , Nc ) measured for each 
E.coli gene. The two first principal components (PCI and PC2) turn out to represent for as much as 85% of the 
total variance of codon bias over the genome (left plot of Fig. 0. Projection of the first two principal components 
on the individual codon bias indices (loadings) shows that none of the four indices predominantly contributes to the 
data variability (right plot of Fig. 0. Thus, the placement of a gene in the PC1-PC2 plane depends on a weighted 
contribution of all the indices. Interestingly, the genes encoding for the proteins of the eight top MCODE communities 
are well localized and separated in this reduced space (Fig. [Sj) . In particular, the first community (be., the core of 
ribosomal proteins characterized by high values of both CompAI and tAI) is located in the upper left part of the 
graph, isolated from the others. This represents an important evidence: proteins that belong to the densest connected 
cores of the interactome are well-localized in the space of the two principal components. In other words, if a set of 
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FIG. 7. Left plot: Eigenvalues of the correlation matrix between the codon bias indices on expressed sequences. 
Right plot: Projection of the first two PCA components on the individual codon bias indices. Recalling that Nc 
is anticorrelated with the other codon bias indices, PCI results from a weighted and coherent contribution of all the indices, 
whereas, for PC2 the contribution of CompAI and tAI is opposite to that of CAI and Nc. 


proteins are physically and functionally connected in a module, then their corresponding genes should share common 
codon bias features. Conversely, we can obtain an estimate for the conditional probability Pr(link|<i) of a functional 
interaction between proteins, provided that their relative genes fall within a distance d in the plane of the two principal 
components PCI and PC2. Reasonably, we compare Pr(link|d) estimated on the real interactome with ( Pr(link\d))n 
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FIG. 8. Centroids of the top MCODE communities in the space of the first two PCA components. The error 
bars denote the variance of the centroids. 


estimated on the Configuration Model (CM) which, we recall, is a degree-conserving randomization (re-wiring) of 
the network. Fig. [9] shows the Z-score for Pr(link|d) as a function of d , and reveals a peculiar behavior: for small 
distances (d < 2) the probability of finding a connection between two proteins is much higher than what could have 
been expected from a (degree-conserving) random link placement. Conversely, for medium distances (3 < d < 9), 
the linking probability is lower than that of the CM, whereas, the real network and the CM become compatible for 
large distances, where, however, connections are rather few. This analysis shows that sets of genes sharing similar 
codon usage patterns encode for proteins that are much more likely to interact than in situations where chance alone 
is responsible for the structure of the interactome. 



FIG. 9. Histogram of the Z-score for J- > r(link|d) for each pair of genes and their respectively encoded proteins, d 

is the Euclidean distance between pairs of genes in the space of the first two PCA components, and Pr(link|d) is the conditional 
probability of having a link in the PIN between two proteins given that their encoding genes are localized within a distances 
d in the PC1-PC2 plane. The Z-score is obtained as Z[Pr(link|d)] = (Pr(link|d) — (Pr(link|d))fi]/crn[Pr(link|d)]. The gray 
dashed lines mark the significance interval of ±3cr. 


CONCLUSION 

In this work we have introduced CompAI , a novel codon bias index that is inspired by tAI , though conceptually 
distinct. In fact, CompAI does not make reference to lists of highly expressed genes, and is thus unsupervised and 
based on intrinsic information about co-evolution of genes that code for proteins and tRNAs. Conceptually, the 
definition of CompAI is based on a model that postulates a competition between cognate and near-cognate tRNAs 
for the same codon, exposed on the ribosome at each step of protein synthesis. Competitive mechanisms in the 
machinery of ribosomal translation of genes into proteins have been repeatedly suggested and studied in the literature 
[39l | and deserve further attention in order to understand their role for translation efficiency. In particular, CompAI 
was designed in order to provide information about the speed of protein synthesis, being based on proofreading delay 
mechanisms. 
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Our genome-wide analysis of codon bias in E.coli using CompAI as well as other commonly used indices revealed 
that codon usage metrics resting on counting tRNA genes (CompAI and tAI) are strongly and positively correlated 
among themselves—in spite of their conceptually different definition. It would then be quite interesting to check in the 
future whether this correlation is specific to E.coli or it is universally observed in the genomes of bacterial species that 
are either ecologically and evolutionarily close or, by contrast, very far from E.coli. We also found that both CompAI 
and tAI correlate with ERI, the degree of conservation for a gene among similar species, and gene essentiality, whereas, 
CAI and Nc are less sensitive to these quantities. CompAI and tAI values also allow to distinguishing three groups of 
genes, that are differently characterized by both codon choice adaptation, ERI and degree of essentiality, and that also 
feature specific predominant COG signatures. In particular the third group (C), composed of the few genes that are 
highly conserved and with the strongest codon bias adaptation, consists for 30% of essential genes with predominant 
COGs J and L—that refer to translation, ribosome structure and biogenesis, replication, recombination and repair. 
These represent house-keeping and control functions that must be continuously executed by the cell, meaning that the 
genes responsible for them have to be expressed most of the time during the cell cycle. These observations strongly 
support the idea that an increasing selection of codons and, in parallel, a correlated modulation of tRNA availability 
co-evolved along the evolutionary history of a species. 

Finally, we have addressed a theme as relevant as the connection between codon usage bias and protein functional 
or physical interactions. Our main result indicates that, in the course of the evolution of a genome, the functional 
structuring of the complex of interactions between proteins has interfered with the peculiar codon-coding formulation 
of the corresponding genes. In particular we have shown here, for the first time to our knowledge, that communities 
of highly connected proteins in the interactome of E.coli correspond to encoding genes that share the same degree of 
evolutionary adaptation, as expressed by codon bias indices that synthetically represent genetic information encoded 
in the tRNAs sector of the genome. Indeed, CompAI , that is based on a simple representation of tRNA competition, 
seems to detect the codon bias signal behind communities more consistently than the other indices here considered. 
Conversely, we have provided evidence that if two genes have similar codon usage patterns then the corresponding 
proteins have a significant probability of being functionally connected and interacting. This result points out that 
codon bias should be a relevant parameter in the fundamental problem of predicting unknown protein-protein in¬ 
teractions from genomic information. This study is a first exploratory step towards a more complete investigation 
on how communities within protein-protein interaction networks rest on a consistent but still to be decoded codon 
bias signal. Indeed, the connection of the topolo gy o f a network with an underneath semantics is far from trivial, as 
recently pointed out in the specialized literature 4C|. Biological PINs and codon bias offer an interesting case study 
worth to be investigated in a wider perspective |41| . 


SUPPORTING INFORMATION 


Here we define and explain the rationale behind the various codon bias indices that have been proposed in the 
literature. 

Codon Adaptation Index (CAI) [l2| . The pattern of codon usage is ruled by two simultaneous effects j42j: 
translational selection towards optimal codons for each amino acid, and genetic drift that allows the persistence of 
non-optimal codons. It is natural to assume that selection is stronger for codons of highly expressed genes, which 
thus feature a more pronounced bias in the use of codons. The principle behind CAI is exactly that codon usage in 
highly expressed genes can reveal the optimal (?.e., most efficient for translation) codons for each amino acid. Hence, 
CAI builds on a reference set of highly expressed genes to assess, for each codon i, the relative synonymous codon 
usages ( RSCUi) and the relative codon adaptiveness (wi): 


RSCUi = 


X.i 



RSCUi 

max {RSCUjY 
0 = 


(SI) 


In the RSCUi , Xi is the number of occurrences of codon i in the genome, and the sum in the denominator runs 
over the rii synonyms of i; RSCUs thus measure codon usage bias within a family of synonymous codons. Wi is then 
defined as the usage frequency of codon i compared to that of the optimal codon for the same amino acid encoded by 
i — i.e., the one which is mostly used in a reference set of highly expressed genes. 

The CAI for a given gene g is calculated as the geometric mean of the usage frequencies of codons in that gene, 
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normalized to the maximum CAI value possible for a gene with the same composition of amino acid: 


CAI g = 



(S2) 


where the product runs over the l g codons belonging to that gene (except the stop codon). The critical aspect in 
the definition of CAI is that it requires to define a priori reference set of highly expressed genes that is specific for a 
given organism. CAI is then not always transferable; yet, since it is tuned on highly expressed genes, it is generally 
well correlated with gene expression levels in genomes for which reference gene sets are available. In this work, CAI 
for E.coli genes was computed using the DAMBE 5.0 package [43] 

tRNA Adaptation Index (tAI ) [13 1. The speed of protein synthesis is bound to the waiting time for the correct 
tRNA to enter the ribosomal A site [44j|. and thus depends on tRNA concentrations [45| (and, indirectly, on gene copy 
numbers). The consequent adaptation of codon usage to tRNA availability [22|,|4t| is at the basis of tAI, which follows 
the same mathematical model of CAI defining for each codon i its absolute (Wi) and relative (wi) adaptiveness 
value: 


Wi = ^(l-««)tGCNy; 

3 =1 


Wi/W max if W t ± 0 

Wmean otherwise ’ 


(S3) 


where mi is the number of isoacceptor tRNAs that recognize codon i ( i.e., tRNAs that carry the same aminoacid 
that is encoded by i), tGCN ?:j is the gene copy number of the j'-tli tRNA that recognizes the i-th codon, Sij is a 
selective constraint on the efficiency of the codon-anticodon coupling, W max is the maximum Wi value and w mean is 
the geometric mean of all Wi with Wi ^ 0. 

The tAI of gene g is eventually defined as the geometric mean of the relative adaptiveness values of its codons, thus 
estimating the amount of adaptation of gene g to its genomic tRNA pool: 


tAIg — 



(S4) 


The critical issue for tAI is the selection of a meaningful set of Sij values, i.e., weights that represent wobble inter¬ 
actions between codons and tRNAs. Assuming that tRNA usage is maximal for highly expressed genes, these values 
are chosen in order to optimize the correlation of tAI values with expression levels—exactly as CAI. Besides, while 
the efficiencies of the different codon-tRNA interactions are expected to vary among different organisms, values 
are based on the gene expression in Saccharomyces cerevisiae [131 ]— thus lacking universality [47]. In this work we 
have evaluated tAI values of E.coli genes using the CodonR package [48]. 


Effective Number of Codons ( Nc ) [HI- Nc is a measure that quantifies the departure of a gene from the 
random usage of synonymous codons. Given a sequence of interest, the computation of Nc [49] starts from the 
quantity—defined for each family a of synonymous codons: 


Tfbry 

F CF a = 

k=l 



(S5) 


where m a is the number of codons in a (each appearing m a , ri 2 a ,..., n mc , times in the sequence) and n a = Y ™=l nk a ■ 
Nc then weights these quantities in order to measure amount of entropy in the codon usage of the sequence: 


Nc = N s + 


k 2 

k 2 Y n ° 

a=l 


k 3 

k 3 y n o 

a=l 


K 4 

k 4 y n o 

ot—1 


Y ( n aFcF a ) Y ( n a FcF a ) Y ( n a FcF a ) 

ct—1 ot—1 ot —1 


(S6) 


where Ns is the number of families with one codon only and K m is the number of families with degeneration m 
(families with degeneration 6 are divided into two families of degeneration 2 and 4, as they often are subject to 
different selective forces). Note that Nc reaches is maximal value (61) when all codons are used equally and its 
minimal value (23) when only one codon is used per amino acid (extreme bias). Differently from both CAI and 
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tAI , Nc is a more immediate measure of codon usage that does not require any a priori information nor makes any 
biological hypothesis (which constitute its weakness and, and the same time, its strength). Yet, since the effect of 
selection is a reduction of entropy for codon usage in a sequence, Nc provides a reliable measure for this effect. In 
this paper we have obtained Nc values through DAMBE 5.0 [43j . 


SUPPORTING FIGURES 


■ cognate 



anti-codon 


FIG. SI. Abundance of tGCN cognate and near cognate (according to Watson-Crick base pairing) for each 
anti-codon in E.coli. Data taken from [TiJ 
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FIG. S2. Left plot: Distribution of confidence levels Q(w), with the vertical line indicating the cut-off we use to separate 
true from false positives. Right plot: Distribution of degrees P(k ) when 0 = 0.9, with the insets showing the same 
distribution for the original network (0 = 0). 
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